Autotuning Divide-and-Conquer Matrix-Vector Multiplication
ثبت نشده
چکیده
Divide and conquer is an important concept in computer science. It is used ubiquitously to simplify and speed up programs. However, it needs to be optimized, with respect to parameter settings for example, in order to achieve the best performance. The problem boils down to searching for the best implementation choice on a given set of requirements, such as which machine the program is running on. The goal of this thesis is to apply and evaluate the Ztune approach [14] on serial divide-and-conquer matrix-vector multiplication. We implemented Ztune to autotune serial divide-and-conquer matrix-vector multiplication on machines with different hardware configurations, and found that Ztuneoptimized codes ran 1%-5% faster than the hand-optimized counterparts. We also compared Ztune-optimized results with other matrix-vector multiplication libraries including the Intel Math Kernel Library and OpenBLAS. Since the matrix-vector multiplication problem is a level 2 BLAS, it is not as computationally intensive as level 3 BLAS problems such as matrix-matrix multiplication and stencil computation. As a result, the measurement in matrix-vector multiplication is more prone to error from factors such as noise, cache alignment of the matrix, and cache states, which lead to wrong decision choices for Ztune. We explored multiple options to get more accurate measurements and demonstrated the techniques that remedied these issues. Lastly, we applied the Ztune approach to matrix-matrix multiplication, and we were able to achieve 2%-85% speedup compared to the hand-tuned code. This thesis represents joint work with Ekanathan Palamadai Natarajan. Thesis Supervisor: Professor Charles E. Leiserson Title: Edwin Sibley Webster Professor in Electrical Engineering and Computer Science 1This research was supported in part by NSF Grants 1314547 and 1533644.
منابع مشابه
Autotuning divide-and-conquer stencil computations
This paper explores autotuning strategies for serial divide-and-conquer stencil computations, comparing the efficacy of traditional “heuristic” autotuning with that of “pruned-exhaustive” autotuning. We present a pruned-exhaustive autotuner called Ztune that searches for optimal divide-and-conquer trees for stencil computations. Ztune uses three pruning properties — space-time equivalence, divi...
متن کاملDivide and conquer the Hilbert space of translation-symmetric spin systems.
Iterative methods that operate with the full Hamiltonian matrix in the untrimmed Hilbert space of a finite system continue to be important tools for the study of one- and two-dimensional quantum spin models, in particular in the presence of frustration. To reach sensible system sizes such numerical calculations heavily depend on the use of symmetries. We describe a divide-and-conquer strategy f...
متن کاملParallelization of Divide-and-Conquer by Translation to Nested Loops
We propose a sequence of equational transformations and specializations which turns a divide-and-conquer skeleton in Haskell into a parallel loop nest in C. Our initial skeleton is often viewed as general divide-and-conquer. The spe-cializations impose a balanced call tree, a xed degree of the problem division, and elementwise operations. Our goal is to select parallel implementations of divide...
متن کاملTransformation of Divide & Conquer to Nested Parallel Loops
We propose a sequence of equational transformations and specializations which turns a divide-and-conquer skeleton in Haskell into a parallel loop nest in C. Our initial skeleton is often viewed as general divide-and-conquer. The specializations impose a balanced call tree, a xed degree of the problem division, and elementwise operations. Our goal is to select parallel implementations of divide-...
متن کاملOptimizing Skeletal Stream Processing for Divide and Conquer
Algorithmic skeletons intend to simplify parallel programming by providing recurring forms of program structure as predefined components. We present a new distributed task parallel skeleton for a very general class of divide and conquer algorithms for MIMD machines with distributed memory. Our approach combines skeletal internal task parallelism with stream parallelism. This approach is compare...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016